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ABSTRACT 

la  the  drat  of  three  papers  in  this  report,  we  describe  a  parallel  model  designed  to 
solve  a  class  of  relatively  simple  problems  from  elementary  physics,  and  discuss  the 
implications  for  models  of  problem  solving  in  general.  We  show  how  one  of  the 
most  salient  features  of  problem  solving,  sequentiality,  can  emerge  naturally  within  a 
parallel  model  that  has  no  explicit  knowledge  of  how  to  sequence  analysis.  This 
model  exploits  a  new  type  of  parallel  distributed  processing  that  employs  stochastic 
processors  and  is  based  on  a  formal  mapping  between  parallel  computation  and  ther¬ 
mal  physics.  The  mathematical  theory  of  this  type  of  processing— harmony  theory— 
is  discussed  in  the  second  and  third  papers. 
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A  PARALLEL  MODEL  OP  (SEQUENTIAL)  PROBLEM  SOLVING 


Natan  af  Roles  and  Their  Interaction 

This  paper  Is  concerned  with  the  nature  of  the  rules  involved  in  solving  problems  and  the 
interaction  among  those  rules.  We  describe  a  parallel  model  derigned  to  solve  a  class  of  relatively 
ample  problems  from  elementary  physics  and  discuss  its  implications  for  models  of  problem  solving  in 
general.  We  show  how  one  of  the  moat  salient  features  of  problem  solving,  sequentiality,  can  emerge 
imtmeUy  within  a  parallel  model  that  has  no  explicit  knowledge  of  how  to  sequence  analysis. 

Consider  the  problem  shown  in  Figure  1.  The  task  is  to  determine  the  qualitative  effects  of 
increasing  the  resistance  of  on  other  circuit  values,  assuming  the  applied  voltage  and  resistance  of  Jtj 
remain  unchanged. 

A  common  approach  to  meddling  the  process  of  solving  problems  like  these  is  to  assume  that 
knowledge  is  organised  as  a  production  system,  similar  to  that  shown  in  Table  1  (see  Riley,  1984,  for  a 
review).  Here  the  model’s  rules  for  making  inferences  are  in  the  form  of  condition-action  pairs,  or 
pradaetbav.  The  condition  specifics  the  particular  elements  and  relations  that  must  be  present  in  the 
data  base  in  order  for  the  condition  to  be  true.  When  the  production  system  is  solving  a  problem,  the 
conditions  of  the  various  productions  are  tested  in  order  until  one  of  them  is  true;  the  action  of  that 
production  is  then  performed.  The  action  generally  makes  some  chsnge  far  the  data  baae  which  in  turn 
means  the  condition  of  a  different  production  wffl  be  tine,  causing  another  action  to  be  performed. 

Since  production  systems  are  univcnal  computers,  they  can  be  programmed  to  display  any 
behavior  (Newell,  1981).  However  certain  kinds  of  behavior  can  be  achieved  with  other  styles  of 
computation  in  more  economical,  elegant,  extendible  and  natural  ways.  Features  that  are  intrinsic  to, 
or  naturally  incorporated  within,  a  pure  production  system  approach  are: 

1)  Seqaentidity:  each  action  taken  utilizes  the  knowledge  contained  in  precisely  one  rule. 

2)  Directionality:  the  knowledge  encoded  in  each  rule  has  a  distinct  directionality  from  input 
(condition)  to  output  (action). 

3)  Exact  matching:  each  rule  acts  only  when  a  perfect  match  to  its  condition  occurs. 

4)  Determlaim:  performance  will  be  identical  on  all  solutions  of  a  given  problem. 

Within  the  production  system  approach  it  is  difficult  to  naturally  avoid  certain  difficulties: 

1)  Lack  of  robaatneu  maier  degradation  of  ratea  (either  removal  of  correct  rules  or  addition  of 
incorrect  ones). 

2)  Lack  of  robaatneu  axdor  Ul for  mod  problemt  that  contain  inconsistent  or  insufficient  given 
information. 

3)  Lack  of  amiability  in  routes  to  correct  answers  or  in  correctness  of  answers;  a  problem  for 
modelling  human  behavior. 

4)  Need  for  explicit  conflict  resohaion  ralot  that  determine  which  rule  will  apply  when  several 
have  true  conditions. 

The  parallel  distributed  processing  approach  represented  by  our  model  naturally  avoids  these 
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Tablet 

A  Simple  Production  System  tor  Solving  the  Problem  in  Figure  1. 
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difficulties,  bat  turn  its  own  problems,  as  me  shall 


Tha  Model 


Oar  model  has  been  constructed  within  the  framework  of  harmony  theory  (Smolensky,  1983, 
1984).  Rales  are  represented  as  a  collection  of  nodes  in  a  network,  ss  shown  in  Figure  2.  A  typical 
role  is  </  down,  Vi  down,  &i  «w>;  this  rale  states  that  the  combination  of  changes  'voltage  across 
Xt  down,  current  goes  down,  Jtt  stays  the  same*  is  a  consistent  set  (Ohm’s  Law).  In  fact,  the  rules 
consist  precisely  of  all  allowed  combinations  of  qualitative  changes  in  circuit  variables  that  are 
consistent  with  each  circuit  law.  There  are  63  sack  instances.  1 


Unlike  condition-action  rules,  there  is  no  directionality  associated  with  the  variables  in  the 
harmonium  rules. 


In  a  particular  problem,  only  some  of  the  instances  represented  by  harmonium  rules  ate  relevant. 
To  represent  this,  each  rule  node  has  an  aedvmUm  vdw  that  can  be  cither  1  (active)  or  0  (inactive). 

In  addition  to  rule  nodes,  the  harmonium  model  contains  nodes  for  representing  the  problem  in 
terms  of  qualitative  changes  in  circuit  variables.  Some  nodes  have  values  given  by  the  problem 
(8a  V*  same,  same).  The  model’s  answer  is  represented  by  values  assigned  to  the  remaining 

nodes. 


As  shown  in  Figure  2.  there  is  a  connection  between  an  individual  circuit  variable  node  and  each 
rale  involving  it;  this  connection  is  labelled  by  the  appropriate  value  for  that  variable  according  to  that 
rule. 


The  goal  of  processing  is  to  find  a  set  of  rale  nodes  to  activate  and  a  set  of  values  for  circuit 
variables  that  are  consistent  with  thoee  rules.  Search  toward  this  god  is  guided  by  a  measure  of  the 
consistency  between  a  set  of  activated  rules  and  a  set  of  circuit  variables:  this  measure  is  called  the 
harmony  fraction.  The  state  of  highest  harmony  should  be  the  correct  answer  to  the  problem. 


Ftaccasing  is  probabilistic  and  constructed  so  that  at  any  moment,  dm  higher  dm  harmony  of  « 
state,  the  more  probable  U  It.  The  spread  in  this  probability  distribution  is  determined  by  a  system 
parameter  called  the  compmtational  temperatwe  T.  Initially,  all  rules  are  inactive,  the  circuit  variables 
given  by  the  problem  ere  assigned  their  values,  and  the  remaining  circuit  variables  ate  assigned  random 
values.  The  temperature  is  set  to  a  high  value,  and  the  stochastic  search  begins.  Rules  are  activated 
and  deactivated,  circuit  variable  values  are  changed  (except  the  given  ones),  and  states  ate  visited  in 
accordance  with  their  harmony.  As  die  search  continues,  this  temperature  is  lowered,  and  the  system 
displays  km  and  less  randomness,  focusing  in  on  the  states  of  highest  harmony.  After  a  while,  the 
temperature  becomes  very  low,  and  the  search  is  effectively  stopped:  an  answer  has  been  selected. 


Sequentiality  of  deduction  seems  to  be  completely  tacking  from  the  harmonium  model,  although 
it  is  a  vary  salient  feature  of  human  problem  solving.  Just  the  same,  in  cresting  this  model  we  expected 
it  to  display  an  emergent  serial! ty.  If  a  single  circuit  variable  is  monitored  during  the  search,  it  will 
fluctuate  randomly  at  first,  and  eventually  *lock  in'  to  a  mine  that  is  very  resistant  to  change.  The 
temperature  at  which  this  ocean  is  the  'framing  temperature*  for  that  variable.  We  expected  that 
different  variables  would  have  different  f  exing  temperatures,  depending  on  the  problem  situation;  the 
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one  with  highest  freezing  temperature  would  settle  first,  which  would  in  turn  determine  the  value 
selected  for  the  variable  with  the  next  highest  freezing  temperature,  and  so  on. 

In  addition  to  T ,  harmonium  models  have  a  second  global  parameter,  k;  it  is  the  sole  parameter 
in  the  definition  of  the  harmony  function.  When  k  is  near  one,  only  rules  that  match  the  current 
guesses  for  circuit  values  exactly  can  become  active  without  lowering  the  harmony  of  the  state;  for  low 
values  of  k,  approximate  matches  are  sufficient.  Initially,  k  is  small,  approximate  matching  is 
encouraged,  and  many  rules  become  activated;  as  the  computation  proceeds,  k  approaches  one  and  the 
set  of  active  rules  shrinks  toward  the  five  that  exactly  match  the  answer. 

As  the  node  for  each  circuit  variable  freezes  into  a  value,  it  does  so  under  the  influence  of  all  the 
active  rule  nodes  connected  to  it.  Unlike  a  production  system,  matching  for  rules  need  not  be  exact, 
and  several  rules  can  act  at  the  same  time. 

The  harmony  function  we  used,  as  well  as  the  schedule  for  lowering  T  and  raising  k,  are  shown  in 
Figure  3.  A  trial  consisted  of  400  iterations  of  100  node  updates  each;  since  there  are  79  nodes  in  the 
model,  this  corresponds  to  slightly  over  300  updates  of  each  node. 

The  stochasticity  of  the  model  produces  variability  in  the  behavior.  In  a  run  of  30  trials,  the 
correct  answer  was  produced  28  times.  When  the  30  values  the  system  assigned  to  the  circuit  variables 
for  each  of  the  400  iterations  ate  averaged.  Figure  4  results.  In  this  graph,  up  is  represented  by  1  and 
down  by  -1.  Initially,  the  values  tor  all  variables  fluctuate  around  zero;  eventually,  each  goes  towards 
the  correct  value.  The  time  at  which  the  four  decisions  are  made  are  indicated  in  the  last  portion  of 
this  figure,  in  which  the  region  between  i  and  -i  has  been  removed.  The  sequence  of  assignments  is 
X— V„  V2‘,  the  sequence  of  "itferencef  that  emerges  naturally  from  the  parallel  processing  Is 
exactly  the  sane  as  the  sequence  produced  in  a  production  system  model. 

The  harmonium  model  displays  both  types  of  robustness  that  are  difficult  to  achieve  naturally 
with  production  systems.  Since  individual  inferences  are  made  under  the  simultaneous  influence  of 
several  rules,  they  ate  less  vulnerable  to  degradation  of  a  single  rule.  When  inconsistent  information  is 
given  in  a  problem,  the  harmonium  model  finds  the  most  consistent  (highest  harmony)  answer  possible. 
When  insufficient  information  is  given,  the  system  finds  one  of  the  correct  answers,  and  finds  different 
answers  on  different  trials.  Such  a  robust  tendency  to  form  coherent  interpretations  of  inputs  is 
important  both  for  modelling  human  cognition  and  for  building  intelligent  machines. 

Ext  carious 

While  the  parallel  distributed  processing  approach  has  certain  advantages  over  the  production 
system  approach,  it  also  has  grave  disadvantages.  The  most  serious  is  the  difficulty  of  performing 
symbol  manipulation.  Without  variable  binding  mechanisms,  types  and  tokens,  it  is  difficult  to 
imagine  how  to  develop  a  general  model  capable  of  analyzing  a  variety  of  circuits;  our  model  is 
specialized  to  a  single  circuit,  and  even  so  we  must  replicate  the  rules  encoding  valid  instances  of 
Ohm’s  Law  throe  times  (once  for  each  relevant  piece  of  the  circuit). 

It  may  be  psychologically  plausible  to  postulate  a  small  collection  of  networks  like  our 
harmonium  model  (or  perhaps  one  integrated,  larger  network)  incorporating  knowledge  about  similarly 
simple  circuits  (e.g.,  a  circuit  with  two  resistors  wired  in  parallel).  These  could  conceivably  serve  as 
prototypes  that  would  be  invoked  to  deal  with  pieces  of,  or  schematic  versions  of,  larger  circuits. 
However  some  powerful  mechanism  would  still  have  to  coordinate  the  parallel  analyses  of  circuit 
fragments. 
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It  is  tempting  to  use  a  production  system  for  this  coordination,  combining  the  strengths  of  the 
two  approaches.  Such  a  hybrid  model  might  well  be  able  to  analyze  complex  circuits,  but  would 
display  the  production  system  weaknesses  (lack  of  robustness,  and  so  forth)  in  those  aspects  of  the 
analysis  that  were  relegated  to  the  production  system. 

One  interpretation  of  such  a  hybrid  model  is  that  the  production  system  component  is  actually 
just  a  complex  parallel  processing  network  viewed  m  a  higher  level  cf  description;  the  hybrid  is  of 
descriptive  levels— there  are  not  two  independent  processes.  It  is  a  major  goal  of  ours  to  see  if  parallel 
models  are  capable  of  exhibiting  emergent  production-like  behavior;  the  emergent  seriality  of  the 
present  harmonium  model  is  an  example  of  just  such  behavior. 


The  harmonium  model  has  implicit  knowledge  of  circuit  laws  that  enable  it  to  model  naturally  the 
nonsymbolic,  intuitive  component  of  problem  solving  that  is  difficult  to  model  naturally  with 
production  systems  and  is  particularly  salient  in  expertise.  At  the  same  time  it  lacks  the  explicit 
knowledge  of  symbolic  laws  that  most  experts  possess.  Thus  to  model  expert  problem  solving  in 
general,  it  seems  necessary  to  imbed  the  harmonium  model  within  a  hybrid  parallel/production  system 
model.  We  are,  however,  investigating  whether  the  symbolic  component  of  experts’  processing  can  be 
preempted  with  conditions  of  very  short  response  times,  making  such  experimental  conditions 
appropriate  for  testing  the  pure  harmonium  model.  We  are  also  planning  to  study  unschooled 
electronics  experts  to  see  to  what  extent  they  are  free  of  conscious  rule  application  in  their  solution  of 
simple  circuit  problems. 

Much  work  remains  to  be  done  in  analyzing  the  variation  in  the  model’s  performance,  and 
assessing  the  dependence  of  performance  on  the  schedules  for  T  and  a  and  the  representation  of  the 
circuit.  Indeed  it  is  the  development  of  more  powerful  representations  within  the  parallel  distributed 
processing  paradigm  that  is  the  primary  goal  of  harmony  theory;  by  trying  to  enrich  the  knowledge  of 
our  harmonium  model  to  incorporate  more  "symbol-like*  explicit  knowledge  of  circuit  laws,  we  hope 
to  gain  more  insight  into  how  symbol  manipulation  might  emerge  from  parallel  distributed  processing. 
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THE  MATHEMATICAL  BOLE  OF  SELF-CONSISTENCY 
IN  PARALLEL  COMPUTATION 

Analysis  of  Etrpat  Properties  «f  Neural  Systems 


One  approach  to  the  mind/body  problem  is  to  view  the  description  of  mind  as  a  higher  level 
description  of  brain — to  view  psychological  principles  as  emergent  properties  of  neural  systems. 
Certainly  before  such  a  view  can  be  scientifically  tested,  a  better  understanding  of  both  brain  and  mind 
must  be  established.  However,  enou^i  is  already  known  about  each  to  make  feasibility  studies 
possible. 

What  methodology  is  capable  of  analysing  the  emergent  properties  of  large  complex  systems  of 
interacting  elements?  One  discipline  where  this  analysis  needs  to  be  done  is  statistical  physics,  where 
Urge  scale  properties  of  matter  me  derived  mathematically  from  the  principles  believed  to  govern  the 
interactions  of  molecular  and  sub-molecular  constituents. 

Is  it  possible  to  apply  similar  kinds  of  mathematical  analysis  to  deduce  emergent  properties  of 
neural  systems?  Although  the  principles  governing  neuronal  interaction  are  by  no  means  as  well 
understood  as  those  governing  particles,  models  that  abstract  some  of  the  characteristics  of  neural 
networks  have  been  studied  for  some  time.  Hbpfieid  (1982)  has  shown  that  with  certain  modifications, 
standard  neural  models  can  be  analysed  with  mathematics  much  like  that  of  statistical  physics,  and  that 
emergent  properties  can  be  analysed. 

One  of  the  central  concepts  in  statistical  physics  is  wupireirt.  The  utility  of  this  concept  in 
performing  difficult  computations  has  been  diown  by  Kirkpatrick  et  al.  (1983).  However  the  most 
important  concept  in  statistical  physics,  as  in  all  branches  of  physics,  is  that  of  energy.  The  meaning  of 
'energy  in  the  computational  contest  is  not  obvious;  rather  than  a  computational  interpretation, 
Hbpfieid  offered  a  general  formula  for  the  'energy*  of  a  neural  net,  while  Kirkpatrick  et  al.  hand 
crafted  'energy*  formulae  for  their  particular  computations. 

The  application  of  statistical  physics  concepts  to  computation  is  now  a  rather  active  field  of  study 
(Hinton  A  Sejnowski,  1983;  Hofstadter,  1983;  Gcmaa  A  Genua,  1983).  To  provide  a  solid  foundation 
for  this  analysis,  what  is  required  in  my  opinion  is  m  imerpretmiom  of  ’energf  thm  emabiitbes  a  deep 
caamcHm  between  the  fermatlm  of  ttmistical  pkytict  mi  rim  central  problem  of  cognition. 


In  this  paper  I  will  present  the  interpretation  of  'energy*  that  lies  at  the  heart  of  a  general 
computational  approach  I  have  been  developing  independently  of  the  work  of  those  interested  in 
neural  nets  or  in  particular  difficult  computations.  In  this  interpretation,  'energy*  it  a  manure  cf  rim 
te!f<mtimacy  tf  e  cempmetimat  mate.  In  place  of  the  term  'energy,'  which  emphasizes  the  physical 
analogy,  or  the  more  technical  term  'Hamiltonian,*  which  serves  only  to  recall  history  and  account  for 
the  physicist's  notation  ft,  I  choose  to  foreground  the  measurement  of  self-consistency  by  using  the 
term  harmony  fmetion,  denoted  H .  The  general  framework,  kmmoay  theory,  is  described  in  Smolensky 
(1984);  an  analysis  of  learning  using  this  theory  is  bepn  in  Smolensky  (1983),  and  an  application  of  the 
theory  to  meddling  qualitative  analysis  of  a  simple  electric  circuit  (with  a  discussion  of  the  model’s 
emergent  properties)  is  described  in  Riley  and  Smolensky  (1984).  In  this  paper  I  will  focus  on  the 
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computational  meaning  of  harmony,  passing  quickly  over  other  aspects  of  the  theory.  The  treatment 
will  be  very  informal;  for  more  formal  presentations  the  reader  is  referred  to  the  previously  cited 
papers. 

The  Rale  of  Harmony  In  Cam  potation 

Before  considering  how  the  harmony  function  is  defined,  we  start  with  a  discussion  of  how  the 
harmony  function  is  need  during  computation.  The  basic  idea  can  be  framed  at  a  very  general  level. 
During  computation,  search  for  an  answer  is  guided  by  a  measure  of  'good new*  of  possible  answers: 
the  harmony  function  H  is  that  measure.  The  search  is  stochastic;  the  computation  is  a  Monte  Carlo 
random  walk  through  the  solution  space  under  the  guidance  of  H.  The  random  walk  is  designed  so 
that  eventually,  the  probability  at  any  moment  of  viaiting  a  point  p  in  the  solution  space  is  given  by 
the  canonical  distribution: 

prob GO  -  NeH{fVr 

N  is  the  constant  needed  to  normalize  the  probabilities  so  that  they  sum  to  one.  7  is  a  global 
parameter  that  determines  the  spread  fai  the  probability  distribution. 

The  canonical  distribution  is  the  only  continuous  relationship  between  H  and  probability  that 
correctly  treats  the  independence  of  components  of  a  computation.  The  canonical  distribution  also 
happens  to  be  the  distribution  on  which  moat  of  statistical  physics  is  based.  (This  is  no  coincidence, 
as  the  notion  of  independent  subsystem  in  physics  maps  onto  that  of  independent  subcomputations.) 
There  is  an  isomorphism  that  maps  the  harmony  function  into  minus  the  Hamiltonian  (energy) 
function,  and  T  into  temperature.  This  suggests  calling  7  the  comptoationd  temperature  of  the  system. 

In  physics,  the  Hamiltonian  determines  what  states  are  most  probable:  the  states  with  lowest 
energy  are  most  probable  at  all  temperatures,  and  states  of  high  energy  have  negligible  probability 
except  at  high  temperatures.  In  harmony  theory,  the  harmony  function  determines  what  states  ate 
most  probable:  the  states  with  highest  harmony  ate  moat  probable  at  all  computational  temperatures, 
and  states  of  low  harmony  have  negligible  probability  except  at  high  temperatures.  7  can  be  thought 
of  as  setting  the  scale  toe  what  constitutes  significant  differences  in  harmony  values.  In  fact,  the  ratio 
of  probabilities  of  two  states  is  ,  where  AH  is  the  difference  in  harmony  between  the  states.  If 
this  difference  is  small  compared  to  7 ,  the  ratio  of  probabilities  will  be  close  to  one;  if  AH  is  large 
compared  to  7 ,  the  state  with  higher  harmony  will  be  many  times  more  probable. 

The  goal  of  the  computation  is  to  find  the  state  of  highest  harmony.  This  means,  in  particular, 
that  the  state  of  neat  highest  harmony  should  be  much  leas  likely.  This  requites  that  7  be  small 
compared  to  the  harmony  difference  between  the  two  highest  levels  of  harmony. 

We  could  simply  set  7  to  be  such  a  low  value  and  be  done  with  it.  However,  this  is  not  a 
practical  search  procedure.  The  Monte  Carlo  procedure  will,  if  let  ran  long  enough,  visit  points  with 
the  probabilities  given  by  the  canonical  distribution.  However,  the  time  required  to  teach  this 
'thermal  equilibrium*  grows  extremely  rapidly  as  7  is  lowered.  A  more  practical  way  of  zeroing  in  on 
the  state  of  highest  harmony  is  to  start  with  a  high  temperature  and  gradually  lower  it.  Early  in  the 
search,  only  large  harmony  differences  are  significant,  and  the  system  quickly  makes  a  crude  cut  at  the 
problem,  avoiding  states  of  extremely  low  harmony.  As  the  system  cools  down,  smaller  harmony 
differences  become  rigni Scant,  and  more  and  more  states  are  avoided  as  the  search  focuses  on  states 
with  harmonies  close  to  the  maximal  value.  If  the  cooling  is  done  gently,  the  state  of 
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harmony  should  be  found  in  much  less  time  than  by  giving  T  a  constant  low  value. 

The  Relation  of  Harmony  to  the  Environment 

We  have  discussed  a  stochastic  search  technique  that  will  find  states  of  high  harmony.  But  how 
do  we  design  the  function  H  so  that  the  states  with  high  H  values  give  the  correct  solutions  to 
problems?  Now  we  must  discuss  the  sense  in  which  H  measures  self-consistency. 

The  'correct*  answer  to  problems  are  often  those  that  satisfy  a  set  of  rules.  In  the  circuit  analysis 
problem  considered  by  Riley  and  Smolensky  (1964),  for  example,  the  rules  are  the  physical  laws  of 
simple  circuits.  Any  system  that  can  correctly  solve  problems  such  as  this  must  in  some  sense  have  a 
representation  of  the  rules.  In  harmony  theory,  the  rules  are  encoded  in  the  harmony  function.  The 
question  is,  how  are  these  rules  encoded,  and  how  can  a  system  develop  an  appropriate  harmony 
function  through  experience? 

Of  course  most  cognitive  tasks  are  not  as  strictly  governed  by  rules  as  is  formal  problem  solving. 
Yet  all  cognition  hinges  on  the  exploitation  of  repdarities  in  the  environment,  even  if  those  regularities 
are  leas  formal  than  Ohm's  Law.  Cognition  enables  organisms  to  do  the  completion  task:  take  some 
limited  information  about  the  current  state  of  their  environment  and  make  reasonable  guesses  about 
what  else  is  likely  to  occur  in  the  environment.  That  is,  given  tome  of  the  features  that  specify  the 
environmental  state,  the  organism  can  make  reasonable  gueacs  about  missing  features. 

In  harmony  theory,  the  'rules”  applied  during  the  completion  task  are  simply  ttmemenu  thee 
certain  feantret  ran  co-occur  in  the  environment.  In  the  circuit  application,  for  example,  in  place  of  a 
symbolic  version  of  Ohm’s  Law,  V  ■  IK,  there  are  many  'ruled'  that  each  record  a  single  combination 
of  qualitative  changes  in  V, I,  and  X  that  me  consistent  with  the  law.  These  'rules'  can  in  fact  be 
thought  of  as  memory  tracet  that  might  be  left  behind  by  individual  experiences  in  the  environment  in 
which  the  regilarities  hold. 

Here  is  the  general  idea  of  how  to  set  up  a  harmony  function  for  performing  the  completion  task 
in  a  given  environment.  Imagine  the  system  experiencing  many  encounters  with  the  environment;  each 
leaves  many  traces  that  each  record  some  of  the  features  that  co-occurred.  When  partial  information 
about  the  current  state  of  the  environment  is  given  in  a  completion  problem,  the  harmony  of  a 
possible  completion  of  that  information  is  the  overall  consistency  between  that  completion  and  the  set  of 
all  tracet.  To  spell  this  out,  we  consider  first  how  the  traces  are  determined  and  then  how  the  'overall 
consistency*  is  computed. 

The  traces  can  be  produced  automatically  by  rimulating  exposure  to  an  environment,  or  they  can 
be  produced  manually  by  the  modeller.  The  latter  technique  was  used  in  the  circuit  problem:  each 
trace  was  chosen  to  be  an  allowed  combination  of  qualitative  changes  in  the  circuit  quantities 
appearing  in  a  single  circuit  law.  The  automatic  generation  of  traces  is  yet  to  be  explored;  the  idea  is 
that  traces  would  be  produced  in  a  random  fashion  (guided  by  the  degree  to  which  potential  traces 
would  enhance  system  harmony);  the  statistical  properties  of  the  resulting  set  of  traces  would  then 
govern  the  emergent  behavior  of  the  system. 

How  is  'the  overall  consistency  between  a  completion  and  the  set  of  all  traces'  computed?  The 
idea  here  is  that  for  each  trace,  a  decision  needs  to  be  made  about  whether  the  instance  it  recorded  is 
relevant  to  the  current  situation.  Borrowing  the  usage  of  schema  theory,  a  match  between  part  of  a 
trace  and  a  completed  set  of  environmental  feat  urea  can  cause  the  trace  to  become  active.  The  'overall 
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consistency* — the  harmony— of  a  completion  is  the  sum  over  all  active  traces  of  a  measure  A  of  the 
degee  of  match  between  the  trace  and  the  completion.  A  simple  definition  of  A  is  the  number  of 
features  in  the  completion  that  match  the  trace,  minus  the  number  that  do  not  match.  (A  slightly 
mote  complicated  definition  of  A  was  used  in  the  circuit  analysis  model.) 

There  are  now  two  kinds  of  variables  used  in  the  computation:  features  of  the  environmental 
state,  and  activation  values  for  traces.  The  processing  has  two  components:  computing  the  harmony 
values  of  possible  completions,  and  making  corresponding  random  decisions  about  which  completions 
to  visit.  Computation  of  the  harmony  value  requires  deciding  which  traces  to  activate,  and  this 
requires  computing  the  quality  of  match  A  between  traces  and  the  completion.  Just  as  the  Monte 
Carlo  search  is  used  to  decide  what  completions  to  visit,  it  can  be  used  to  decide  what  traces  to 
activate.  So  using  the  traces  to  define  the  harmony  of  completions  leads  naturally  to  extending  the 
search  space  to  include  both  environmental  feature  variables  md  trace  activation  values. 

The  Netwsrk  Interpretation:  A  Computer  Impteaaentatlen 

It  is  useful  to  represent  the  computation  by  a  network;  a  portion  of  the  network  for  the  circuit 
model  is  shown  in  Fifsre  1.  The  activation  variables  are  represented  by  nodes  in  the  upper  layer,  each 
corresponds  to  a  trace.  The  environmental  feature  variables  are  represented  by  nodes  in  the  lower 
layer.  There  are  connections  between  a  trace  variable  and  ail  the  environmental  features  it 
incorporates.  For  simplicity,  all  variables  (nodes)  are  taken  to  have  binary  values:  trace  activation 
nodes  have  values  arrive  and  iaacrive;  environmental  feature  nodes  have  values  prewar  and  sharia. 

The  Monte  Carlo  search  in  this  network  representation  proceeds  as  follows.  Initially  a  high 
temperature  T  is  chosen,  all  the  traces  are  sat  inactive,  the  environmental  features  are  permanently 
assigned  their  given  values,  and  the  remaining  environmental  feature  variables  me  assisted  random 
initial  values.  Then  ptoceming  begins.  A  node  is  selected  at  random  (but  not  one  of  the  given 
features).  Neat  the  difference  A H  between  the  overall  network  harmonies  that  would  result  from  the 
two  poaaible  values  for  the  node  la  computed.  In  principle,  this  computation  could  be  performed  in 
die  node  itself,  for  the  only  quantities  needed  am  thorn  to  which  the  node  is  connected.  Finally,  the 
node  randomly  selects  a  new  value,  using  as  the  ratio  of  probabilities  for  the  two  values  «A*/r .  The 
process  of  selecting  a  node  and  selecting  a  value  for  that  node  is  iterated  while  the  temperature  T  is 
gradually  lowered  according  to  some  schedule. 

The  repeated  selection  of  nodes  and  assignment  of  new  values  can  be  viewed  (following  Hopfield) 
as  the  asynchronous  processing  of  proccaaon  located  at  the  nodes  and  tunning  in  parallel.  The  relation 
between  this  parallel  ptoceming  network  and  those  considered  by  Hopfield  rod  Hinton  rod  Sejaowski 
is  that  the  harmony  model  has  a  special  architecture:  there  are  two  dames  of  nodes,  and  connections 
between  but  not  within  the  two  clmam.  The  formula  for  harmony  tuns  out  to  be  minus  that  for 
Hopfleid’s  network  'energy/  taking  into  account  the  special  architecture  rod  the  numerical  assignments 
eerivs  ■  1,  Inactive  ■  0;  prewar  ■  1,  aftwar  ■-!. 


Cam  manta  m  Neural  Implcmentatisn 

Since  harmony  theory  is  computationally-inspired,  rather  thro  a eu rally-inspired,  the  edition 
between  the  harmony  network  and  neural  networks  has  not  been  developed.  However  the  dose 
resemblance  of  the  harmony  network  to  Hopfidd’s  neural  network  might  suggest  that  harmony  nodes 
eonrapond  to  neurons,  so  a  brief  comment  is  appropriate.  While  it  does  not  seem  unreasonable  in 
principle  to  identity  environmental  feature  nodes  with  neurons,  it  is  nor  reasonable  to  identity  trace 
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Trace  Nodes 


Environmental 
Feature  Nodes 


Fignc  1.  A,  portion  of  the  network  representation  of  the  circuit  analysis  model  (from  Riley  ad 
Smolensky,  1984).  (The  values  up,  down,  sane  for  environmental  features  (circuit  variable  changes)  arc 
actually  represented  by  using  two  binary  nodes  for  each  variable.] 
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nodes  with  neurons.  Indeed,  I  imagine  that  each  trace  is  distributed  over  the  synapses  of  the  neurons 
corresponding  to  the  environmental  features  involved  in  that  trace.  'Activation*  of  the  trace  might 
correspond  to  a  feedback-mediated  rapid  enhancement  of  the  strengths  of  these  synapses,  as  in  von  der 
Malsberg  (1961).  In  this  sense,  even  the  activation  of  traces,  a  primitive  operation  in  the  theory  as 
presently  formulated,  may  be  an  emergent  property  of  synaptic  dynamics. 

Even  without  a  precise  specification  of  the  relation  between  harmony  networks  and  neurons, 
harmony  theory  offers  a  mathematical  framework  within  which  to  explore  the  emergence  of  mind  from 
brain-like  processing.  The  isomorphism  between  computation  and  statistical  physics  which  it 
represents  rests  on  the  identification  of  self-consistency— harmony— as  playing  a  central  role 
isomorphic  to  that  played  by  energy  in  physics. 
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This  paper  discusses  a  particular  cUas  of  parallel  distributed  processing  models  of  cognition: 
thermal  models.  These  models  employ  stochastic  processors  and  rdy  on  a  formal  mapping 
between  parallel  computation  and  statistical  physics.  A  special  subclass  of  thermal  models 
is  defined  as  the  implementation- level  deacription  of  a  general  mathematical  framework  for 
studying  cognition:  harmony  theory.  Harmony  theory  is  presented  at  the  computational  and 
algorithmic  levels  as  well  as  the  implementation  level. 
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HARMONY  THEORY: 

THERMAL  PARALLEL  MODELS  IN  A  COMPUTATIONAL  CONTEXT 


Iat  redaction 

In  recent  years  considerable  effort  has  been  directed  at  exploring  computational  architectures 
using  a  large  number  of  fairly  simple  processors  running  in  parallel  and  communicating  with  each  other 
across  a  network  of  links.  This  style  of  computation,  which  I  shall  refer  to  as  ptwalUl  distributed 
processing,  has  variously  been  called  'massively  parallel,*  'connectionistic,*  and  ’neurally-inspired 
These  names  suggest  the  variety  of  disciplines  that  have  found  parallel  distributed  processing  to  be 
important  to  understand.  Neuroscientists  use  the  processors  to  model  neurons;  psychologists  and 
computer  scientists  use  parallel  distributed  processing  to  develop  formal  computational  systems  with 
some  of  the  flexibility,  subtlety,  and  power  of  human  cognition.  (For  references,  see  Anderson  A 
Hinton.  1981.) 

The  past  two  years  have  seen  the  emergence  of  a  new  type  of  parallel  distributed  procemtng  that 
employs  stochastic  processors  and  is  based  on  a  formal  mapping  between  parallel  computation  and 
statistical  physics  (Hopfidd,  1982;  Kirkpatrick,  Gelatt,  A  Vecchi,  1983;  Hinton  A  Sejnowski,  1983a,  b; 
Hofstadter,  1983;  Smolensky,  1983).  I  shall  call  cognitive  models  of  this  type  tkermd  models.  In  this 
piper  I  present  a  particular  clam  of  thermal  models-  I  call  kmmomUm  models,  with  allusion  to 
pmdtmotdum  (Selfridge  A  Neisscr,  1980).  I  will  diacaas  how  harmonium  models  differ  from  other 
thermal  models,  as  well  as  how  thermal  models  differ  from  more  traditional  parallel  distributed 
processing  models,  which  I  will  refer  to  as  activation  models.  Characteristic  features  of  harmonium 
models  include:  an  architecture  that  represents  a  ptoceas/data  distinction;  a  global  mathematical  entity, 
the  harmony  function,  that  drives  the  processing;  stochastic  processors;  a  global  system  parameter,  the 
compumitmal  temperature-,  an  algorithmic  process,  cooling;  and  a  new  type  of  system  behavior,  freezing. 

The  harmony  function  corresponds  to  what  others  who  work  on  thermal  models  have  called  the 
energy  function,  taking  the  term  from  thermal  physics.  1  The  harmony  function,  which  is  central  to  the 
processing  of  harmonium  models,  has  an  interpretation  within  the  general  context  of  the  cognitive 
tasks  that  harmonium  models  are  desipied  to  perform:  it  measures  the  self-consistency  of  a 
computational  state  (Smolensky,  1983,  1964).  This  interpretation  is  what  leads  to  the  differences 
between  harmonium  models  and  other  thermal  models. 

The  central  goal  of  this  paper  is  to  introduce  a  partially-developed  general  analytic  framework  I 
call  hmmony  theory.  Within  this  framework  harmonium  models  emerge  with  a  certain  degree  of 
inevitability  as  the  description  is  pushed  from  the  abstract  to  the  implementation  level.  The 
presentation  will  roughly  follow  Mbit's  (1982)  stratification  of  descriptions  of  computational  devices 
into  the  computational,  algorithmic,  and  implementation  levels.  Another  level,  intermediate  between 
the  computational  and  the  algorithmic,  will  also  be  needed. 


1.  The  harmony  function  actually  corresponds  to  the  energy  function  multiplied  by  -1. 


Smolensky 


2 


Harmony  Theory 


Harmonium  models  of  particular  cognitive  processes  acquire  additional  interest  when  viewed 
within  the  general  context  of  harmony  theory.  The  models  are  not  simply  an  attempt  to  simulate 
human  performance,  or  to  get  a  machine  to  'act  intelligently*;  they  are  also  a  vehicle  for  developing  a 
general  account  of  cognition  that  resides  st  a  higher  conceptual  level  than  that  of  the  particular 
models. 


Computational  Level 

Consider  the  cognitive  tasks  of  constructing  a  three-dimensional  percept  from  two-dimensional 
images,  of  constructing  a  coherent  interpretation  of  a  piece  of  text,  and  of  solving  a  problem  in  a 
formal  domain.  These  disparate  tasks  share  an  underlying  structure:  the  filling  in  of  missing 
information,  using  knowledge  about  which  items  fit  together  in  the  task  environment.  Harmony 
theory  begins  with  this  abstract  'completion  task'  and  a  formalization  of  this  sense  of  'environment,* 
and  gradually  descends  in  abstraction  to  the  level  of  implementable  models  of  specific  cognitive  tasks. 
The  ultimate  goal  of  the  research  is  to  develop  a  precise  characterization  of  a  three  related 
mathematical  structures:  (a)  a  cognitive  system;  (b)  an  environment ;  (c)  the  completion  task.  The  cognitive 
system  possesses  concepts  for  representing  states  of  the  environment,  and  knowledge  about  which 
concepts  fit  together  in  the  environment.  The  goal  is  to  investigate:  for  a  given  cognitive  system,  in  a 
given  environment,  what  set  of  concepts  and  knowledge  relating  them  will  enable  the  system  to  perform  the 
completion  task ? 

To  initiate  this  investigation,  I  formulate  an  appropriate  formalization  of  the  term  environment. 

Our  environment  can  be  viewed  as  a  stream  of  overlapping  episodes  of  all  durations,  starting  at 
all  moments  of  time.  Cognition  enables  organisms  to  predict  with  some  accuracy  what  episodes  ire 
likely  to  result  from  their  actions,  given  the  portion  of  those  episodes  about  which  they  already  have 
knowledge.  What  is  critical  about  the  environment  is  that  different  episodes  ktve  different 
probabilities.  The  basic  cognitive  task  of  the  organism  is  the  prediction  of  likely  episodes  given  some 
partial  knowledge. 

For  my  purposes,  then,  I  adopt  these  definitions.  An  environmen t  is  a  probability  space,  the 
points  of  which  are  called  episodes.  In  the  completion  task,  information  that  partially  specifies  an 
episode  is  given  as  input,  and  the  output  is  a  set  of  the  most  probable  episodes  that  are  consistent  with 
the  input. 

Many  of  the  cognitive  tasks  that  are  studied  in  cognitive  science  can  be  viewed  as  specific 
instantiations  of  the  general  completion  task.  In  the  domain  of  story  understanding,  an  episode  is  a 
sequence  of  events  and  actors’  goals.  The  story  partially  specifies  some  episodes;  'understanding  the 
story*  is  the  completion  of  these  to  full  specifications,  including  omitted  events  and  goals.  The 
collection  of  those  episodes  that  could  possibly  occur  in  our  world,  together  with  their  corresponding 
probabilities,  defines  the  story  understanding  environment.  In  the  domain  of  visual  perception,  an 
episode  is  a  sequence  of  positions  of  objects  in  three-dimensional  space.  Streams  of  two-dimensional 
images  directly  specify  episodes  only  partially,  and  the  job  of  perceptual  processing  is  to  complete 
those  specifications. 

Rather  than  tackle  the  temporal  complexities  of  episodes,  I  will  instead  take  an  environment  to 
be  a  probability  distribution  over  static  entities  called  scenes.  The  completion  task  then  generalizes 
many  interesting  cognitive  tasks  that  are  free  of  time,  such  as  understanding  descriptions  of  static 
scenes  and  processing  of  single,  static  images. 
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In  Riley  and  Smolensky  (1984),  harmony  theory  is  used  to  study  an  interesting  static  task: 
qualitative  analysis  of  the  simple  electric  circuit  shown  in  Figure  1.  The  task  is  to  answer  questions 
like,  "What  happens  to  the  circuit  if  R2  is  increased,  assuming  the  voltage  of  the  battery  and  Jtt  remaiu 
unchanged?*  Here  a  'scene'  is  a  set  of  qualitative  changes  In  circuit  features.  Those  sets  of  changes  that 
are  consistent  with  the  circuit  laws  of  elementary  physics  define  the  scenes  that  have  nonzero 
probability  in  this  environment.  The  given  question  specifies  some  of  the  qualitative  changes  defining  a 
scene,  namely,  the  changes  in  the  resistances  and  in  the  battery's  voltage.  The  task  is  to  complete  this 
to  a  full  specification  of  a  *highly  probable*  scene,  i.e.,  fill  in  the  appropriate  changes  in  the  other 
circuit  quantities  like  currents  and  voltage  drops.  If  the  information  given  in  the  problem  uniquely 
determines  a  'correct  answer,'  then,  given  the  input,  one  scene  has  probability  one  and  the  others  have 
probability  zero. 

The  probability  distribution  for  the  environment  of  elementary  physics  problems  is  artificial; 
there  is  sharp  distinction  between  scenes  that  are  'allowed'  and  those  that  are  not,  i.e.,  between  those 
that  have  nonzero  probability  and  those  that  have  zero  probability.  This  characterizes  a  formal 
environment,  one  that  can  be  delimited  strictly  by  formal  rules  like  the  circuit  laws.  While 
conventional  computers  are  at  home  in  such  environments,  people  are  not;  at  least  one  can  argue  that 
more  training  is  required  for  people  to  perform  well  in  formal  environments  than  in  informal  ones.  A 
central  empirical  question  for  this  approach  is:  What  are  the  properties  of  the  environments  in  which 
natural  cognitive  systems  can  actually  perform  the  completion  task  with  some  accuracy ?  At  this  stage, 
intuition  must  serve  in  place  of  an  empirical  answer.  The  environments  are  most  likely  sparse,  with  a 
huge  fraction  of  the  space  of  all  possible  scenes  having  an  extremely  low  probability.  Furthermore,  we 
perceive  scenes  (in  my  generalized  sense  of  the  word)  as  groups  of  entities  which  are  in  turn  groups  of 
sub-entitites,  and  so  on.  This  suggests  that  the  environments  with  which  the  human  mind  is  designed 
to  deal  exhibit  a  kind  of  modularity :  the  probability  of  a  scene  can  be  computed  by  describing  it  in 
terms  of  modules  of  various  scales  and  recursively  computing  the  probability  that  the  modules  at  one 
scale  would  be  combined  to  make  the  modules  of  the  next  larger  scale. 

Concepts,  according  to  this  intuition,  correspond  to  the  modules  in  the  environment. 
Knowledge  about  these  concepts  is  what  enables  us  to  compute  the  probabilities  of  various 
combinations  of  concepts. 

To  formally  define,  at  this  computational  level,  modular  environments,  cognitive  systems,  concepts, 
and  knowledge  is  a  major  goal  of  this  research.  At  the  moment,  however,  precise  characterizations 
corresponding  to  these  notions  exist  only  at  the  algorithmic  level.  The  next  section  describes  the 
intuitions  that  lie  between  the  algorithmic-level  description  and  a  yet -to-be  formulated 
computational-level  description. 


Computation al/Algorlthmlc  Level 

\ 

A  completion  task  can  be  performed,  it  is  assumed,  because  prior  experience  with  the 
environment  has  left  traces  of  statistical  connections  between  the  information  that  is  given  and  the 
information  that  must  be  filled  in.  The  mechanisms  which  might  maintain  such  traces  in  the  brain  are 
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mot  considered.  2  Rather,  I  shall  give  a  strictly  formal  description  of  a  cognitive  system  that  possesses 
such  traces.  This  description  presupposes  a  representation  of  environmental  scenes  within  the 
cognitive  system. 

Each  scene  in  the  environment  is  assumed  to  be  described  in  the  cognitive  system  by  a  set  of 
representation  variables  {Ru  Rt,  '  “  .  Rn)-  Each  Rl  is  taken  (for  simplicity)  to  have  value  true  or  false 
for  each  scene.  In  the  circuit  analysis  problem,  for  example,  Rt  and  Jf2  might  represent  whether  there 
is  a  change  in  the  battery's  voltage  and  current,  respectively;  each  of  these  will  be  either  true  or  false 


for  each  scene.  The  representational  variables  as  a  whole  are  assumed  to  support  representations  of 
scenes  at  all  the  levels  of  abstraction  employed  by  the  cognitive  system;  assigning  values  to  some  of  the 
variables  may  require  considerable  processing. 


The  list  of  true/false  R,  values  for  a  scene  will  be  called  its  representation  vector  R.  Each  R  has 
some  probability  in  a  given  environment.  In  the  completion  task,  some  of  the  elements  of  an  R  vector 
are  specified  as  input  (e.g.,  Ri),  and  as  output  the  system  must  give  values  for  the  remaining  elements 
(e.g.,  Rj). 

The  crucial  question  for  solving  the  completion  task  is,  what  values  of  various  R(  variables  go 
together  in  the  environment?  A  cognitive  system  must  accumulate  the  knowledge  that  answers  this 
question  as  it  experiences  a  sample  of  vectors  R.  Each  R  is  assumed  to  leave  many  traces,  which  are 
simply  copies  of  pieces  of  R.  Each  trace  records  a  single  co-ocurrence  of  the  specific  values  of  the  R, 
variables  present  in  that  trace.  After  considerable  experience  with  the  environment,  an  ensemble  of 
traces  is  built  up;  this  ensemble  implicitly  encodes  the  environment's  probability  distribution. 

In  this  paper  we  shall  not  analyte  the  important  question  of  which  pieces  of  R  are  maintained  as 
traces;  the  issue  is  considered  in  Smolensky  (1963).  Very  roughly,  the  idea  is  that  each  trace  records 
one  of  the  modules  present  in  R.  3  For  present  purposes  we  assume  that  the  set  of  traces  encoding  the 
system’s  knowledge  of  the  environment  has  been  produced  either  by  an  unspecified  training  process  or 
by  explicit  design  of  the  modeller.  In  the  electric  circuit  problem,  for  example,  the  traces  are  put  in  by 
hand;  each  one  consists  of  one  of  the  possible  instances  of  changes  of  circuit  quantities  that  are 
consistent  with  one  of  the  circuit  laws.  (For  instance,  one  of  the  traces  records  the  co-occurrence  of: 
(a)  a  decrease  in  the  current,  (b)  a  decrease  in  the  voltage  drop  across  resistor  1,  and  (c)  no  change  in 
the  resistance  of  resistor  1;  this  is  one  instance  of  Ohm’s  Law  for  resistor  1.) 

The  traces  are  encoded  as  a  set  of  trace  vectors  (Tj,  T2,  *  *  *  ,  T«).  Each  Ta  is  a  piece  of  some 
representation  vector  R,  i.e.,  a  set  of  true/false  values  for  a  subset  of  the  representation  variables.  Ta  is 
viewed  as  a  vector  of  values,  one  tor  each  Rt :  either  true,  false,  or  unspecified. 


1  A  few  remarks  on  possible  neural  implementations  of  harmony  theory  may  be  found  in  Smolensky  (1984). 

3.  A  little  more  precisely,  the  idea  is  this.  For  each  R  experienced  by  the  system,  some  mechanism  transcribes 
pieces  of  R  sad  records  them  as  traces.  After  experiencing  a  large  number  of  scenes  R,  a  large  collection  of 
traces  will  have  accumulated.  Some  traces  will  be  duplicated  many  times,  others  lees  often.  Those  frequently 
duplicated  define  the  primary  concepts  in  terms  of  which  scenes  will  be  processed  by  the  system.  Thus  the 
system's  concepts  emerge  from  the  amUtical  properties  of  the  traces.  Investigation  of  appropriate  mechanisms 
for  recording  traces  must  therefore  center  on  analyzing  the  statistical  properties  of  the  resulting  traces  in  various 
environments.  What  criteria  determine  whether  the  traces  produced  by  some  recording  mechanism  have  reason¬ 
able  statistics?  The  primary  criterion  is  the  system's  performance  on  the  completion  task  enabled  by  those 
traces.  A  rimpler  criterion  thought  to  underly  good  performance  involves  the  concept  of  harmony  to  be  dis¬ 
cussed  below;  this  training  harmony  criterion  is  considered  in  Smolensky  (1983). 


■r. -v:-: 
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Loosely,  here  is  how  completions  are  performed  using  the  traces  Ta.  The  system  checks  to  see 
which  traces  are  consistent  with  the  input;  these  become  active.  When  active  traces  specify  true/false 
values  tor  missing  information,  these  values  are  used  to  decide  how  to  fill  in  the  missing  information. 


Thus,  to  each  trace  vector  T,  is  associated  an  activation  value  Am.  The  list  of  all  activation  values 
forms  the  activation  vector  A.  For  simplicity  each  Am  will  be  taken  to  be  either  active  or  inactive ;  these 
binary  values  will  be  sufficient  for  our  purposes  (unlike  with  traditional  activation  models). 


Actually,  the  assignment  of  true/false  values  to  missing  information  is  done  in  parallel  with  the 
assignment  of  active/inactive  values  to  the  trace  variables.  To  carry  out  the  completion,  the  system 
simultaneously  performs  searches  in  the  space  of  all  vectors  A  and  in  the  space  of  all  vectors  K  that 
contain  the  true/false  values  given  by  the  input. 


The  goal  of  the  search  is  to  find  those  completions  R  that  are  highly  probable  in  the 
environment.  Intuitively,  there  is  a  relationship  between  these  R  and  the  traces  Ta.  Highly  probable 
completions  are  those  containing  combinations  of  modules  that  occur  frequently  in  the  environment. 
Such  a  completion  will  be  highly  consistent  with  many  traces.  This  suggests  using,  in  lieu  of  a  literal 
computation  of  the  probability  of  a  completion  R,  a  measure  of  its  goodness  that  counts  the  number  of 
traces  with  which  R  is  consistent,  and  the  overall  degree  of  that  consistency. 


A  convenient  representation  for  this  measure  of  goodness  incorporates  the  idea  that  traces  that 
i  consistent  with  the  input  should  be  active.  For  any  activation  vector  A  and  completion  R,  define4 


»(A4t)-2AaR-Ta 


Here  the  following  numerical  assignments  are  used;  true  •  1,  false  *  -1,  unspecified  «  0;  active  *  1, 
Inactive  -  0.  •  is  the  inner  product:  R  •  Ta  ■  £,4(7.),;  this  is  just  the  number  of  representational 
variables  R,  whose  values  agree  with  the  corresponding  values  in  Ta,  minus  the  number  that  disagree. 


H  (A  JR)  measures  how  consistent  R  is  with  the  traces  active  in  A:  H  is  called  the  Herman?  f  tmction. 
H  is  the  central  player  in  harmony  theory  because  all  the  decision-making  in  the  system  is  driven 
towards  achieving  maximal  consistency,  i.e.,  harmony.  If  a  trace  is  consistent  with  the  input,  it 
becomes  active  because  doing  so  raises  the  harmony.  If  setting  a  representational  variable  to  a  true  or 
false  value  is  consistent  with  active  traces,  that  assignment  is  made  because  it  raises  the  harmony.  In 
short,  the  space  of  activation  vectors  A  and  completions  R  is  searched  to  find  the  values  that  achieve 
high  harmony.  These  should  be  the  completions  that  are  most  probable  in  the  environment. 


An  important  goal  of  the  theory  is  to  provide  a  mathematical  characterization  of  modular 
environments  that  allows  a  proof  that  high  harmony  completions  are  high  probability  completions.  At 
the  moment,  this  identification  rests  on  intuition. 


In  a  sense,  the  traces  serve  during  search  as  mtt-conttrdnu.  The  good  solutions  are  those  that 


4.  Is  Smolensky  (1983)  and  Riley  and  Smolensky  (1984),  a  somewhat  more  complex  formula  for  H  is  used.  Ax¬ 
ioms  defining  the  properties  a  harmony  function  must  satisfy  will  be  an  important  part  of  the  formulation  of 
harmony  theory  at  the  computational  level. 
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satisfy  a  number  of  anti-constraints,  i.e.,  match  a  number  of  traces.  This  hm  many  similarities  to 
constraint-based  search,  in  which  food  solutions  me  those  that  f<dl  to  viola*  a  number  of  constraints. 
However,  the  differences  are  more  than  technical.  It  is  easy  to  see  how  a  system  can  learn 
anti-constraints:  these  are  simply  the  traces  left  by  experience,  records  that  show  what  variable  values 
earn  go  together.  It  is  not  so  easy  to  learn  what  values  comer  go  together,  when  a  'teacher*  or  reiii forcer 
is  absent.  Furthermore,  if  environmental  probability  distributions  are  sparse,  as  discussed  above,  then 
it  seems  more  feasible  to  record  Mowed  co-occurrences  than  forbidden  ones. 

Algorithmic  Level 

Having  characterized  good  completions  at  the  computational  level  (high  probability)  and  at  the 
computational/algorithmic  level  (hi^i  harmony),  it  is  time  to  specify  an  algorithm  that  will  construct 
good  completions.  The  algorithm  used  in  harmony  thoery  is  a  stochastic  one:  it  constructs  a 
completion  with  a  probability  related  to  its  harmony.  The  only  mathematically  viable  relationship 
(Smolensky,  1983)  is  the  canonical  distribution: 

prob(A,  E)  -  n 

Here  is  is  a  normalization  constant,  and  7  is  a  positive  system  parameter.  If  two  completions  have 
different  harmonies,  the  mote  harmonious  one  wfll  be  more  probable;  if  the  harmony  difference  A H  is 
large  compared  to  7 ,  then  the  ratio  of  probabilities  (eu,fT)  will  be  large.  Thus  the  greater  7 ,  the  less 
will  be  the  him  in  favor  of  the  most  harmonious  completions,  and  the  more  random  the  completions 
will  seem.  The  randomness  parameter  7  is  called  the  compmetionat  temperatnre  because  its  role  in  the 
canonical  distribution  is  identical  to  that  of  phyacal  temperature  in  the  strictly  isomorphic  canonical 
distribution  (Boltzmann  law)  of  statistical  mechanics 

Good  completions  will  be  overwhelmingly  likely  if  and  only  if  the  temperature  is  very  low.  Thus 
to  achieve  good  performance  a  low  temperature  is  needed. 

In  statistical  mechanics  there  is  a  well-known  Monte  Carlo  search  method  (the  *heat  bath 
algorithm*)  that  can  be  used  to  stochastically  explore  the  problem  space  of  vectors  A  and  K,  visiting 
points  in  the  space  with  the  probabilities  of  the  canonical  distribution  (Metropolis,  Rosenbluth, 
Rosenbluth,  Teller,  A  Teller,  1953;  Binder,  1979).  This  algorithm  starts  at  a  random  point,  and 
randomly  chooses  a  possible  direction  of  travel.  The  change  in  harmony  AJf  that  would  result  bom  a 
step  in  that  direction  is  computed;  the  decision  to  take  the  step  is  then  made  randomly,  with 
likelihood  ratio  for  taking  or  not  taking  the  step  set  equal  to  eAHfr .  'Choosing  a  posable  direction  of 
travel*  amounts  to  selecting  a  single  variable  Aa  or  A, ;  'taking  a  step*  amounts  to  changing  the  binary 
value  of  the  selected  variable. 

The  process  of  choosing  a  direction  and  deciding  whether  to  take  a  step  is  iterated.  It  can  be 
proved  that,  eventually,  the  probability  of  being  at  a  point  is  given  by  the  canonical  distribution.  The 
higher  7 ,  the  more  quickly  this  'thermal  equilibrium*  is  reached. 

The  practical  difficulty  with  this  algorithm  is  that  for  the  low  7  values  needed  to  get  good 
completions,  it  takes  an  unacceptably  long  time  to  reach  thermal  equilibrium.  A  way  to  get  to  good 
completions  faster  is  to  start  with  a  fairly  high  temperature,  and  coal  the  system  down  during  the 
computation  (Kirkpatrick,  Gelatt,  A  Vecchi,  1983).  Cooling  (or  'simulated  annealing*)  is  a  new 
computational  process  characteristic  of  thermal  models;  it  brinp  with  it  a  new  computational 
behavior  / reeling.  As  the  temperature  is  lowered,  various  system  variables  lock  in  to  values  which 
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bacon  very  reaisfsaf  to  change. 

What  is  happening  daring  the  cooling  proof  is  that  7 ,  the  scale  on  which  the  significance  of 
harsaoay  differences  is  measured,  is  getting  aaeaUcr  and  smaller.  Early  in  the  processing,  only  big 
harmony  differences  matter  the  system  avoids  only  states  of  very  low  relative  harmony;  s  very  crude 
cot  is  made  at  the  problem.  A  large  region  of  the  problem  space  is  explored.  As  the  processing 
continues,  the  system  gets  more  enacting,  narrowing  the  search  to  states  with  harmony  values  that  ate 
rinser  and  closer  to  the  nrimam  attainable  value.  This  hind  of  search  has  the  advantage  that 
whenever  its  results  ate  examined,  their  reliability  k  as  high  as  can  be  achieved  in  the  elapsed  time 
(loosely  speaking). 

A  good  theoretical  undent  an  ding  of  how  to  reflate  the  cooling  win  be  difficult  to  achieve.  At 
the  moment,  certain  techniques  allow  estimates  of  frecsing  temperatures  in  simple  situations,  but 
cooling  schedules  arc  still  largely  defined  in  ad  hoe  ways. 

The  most  natural  implementation  for  harmony  theory  uses  s  paraUd  computing  network.  This 
parallel  device,  ktrmoidim,  is  easily  simulated  on  a  serial  computer. 

To  implement  the  Monte  Carlo  search  algorithm  discussed  above,  we  set  up  one  processor  for 
each  of  the  variables  Am  and  Ilf.  According  to  the  discussion  in  the  previous  section,  the  values  of  the 
activation  processors  are  1  and  0,  while  Ac  values  for  the  representation  processors  are  +1  and  -1. 
This  set  of  values  is  not  typical  of  thermal  models. 

The  algorithm  first  involves  randoasty  picking  a  direction  in  search  space;  this  amounts  to  picking 
one  of  the  processors.  To  ensure  that  only  one  processor  changes  its  value  at  a  time,  the  processors  are 
assumed  to  take  a  random  amount  of  time  to  make  their  decisions  and  then  instantaneously  nuke  their 
change;  the  probability  of  simultaneous  changes  is  than  aero.  Once  a  change  is  made,  the  new  value 
arest  be  available  immediately.  This  type  of  aqmchrooous  updating  is  not  typical  of  activation  models, 
but  is  typical  of  thermal  models  (Hopfidd,  1982). 

The  algorithm  requires  that  to  decide  oa  its  new  value,  a  processor  must  compute  the  harmony 
change  Aif  that  would  result  from  changing  its  value.  To  perform  this  computation,  a  green  processor 
must  be  connected  to  others  in  order  to  read  their  values.  The  required  pattern  of  interconnect  ions, 
found  by  inspection  of  the  harmony  function,  is  graphically  summarised  in  Figure  2,  in  which  each 
node  is  a  processor.  Aif  for  a  given  node  is  a  weighted  sum  of  the  values  of  the  nodes  connected  to  it; 
the  weight  linking  representation  node  i  and  trace  node  a  is  (T,X,  the  lm  dement  of  the  vector  Ta. 
This  weight  applies  to  values  passed  in  ritkar  direction;  bidirectional  weights  are  characteristic  of 
thermal  models. 

The  architecture  shown  in  Figure  2  is  not  typical  of  thermal  or  activation  models.  The  purpose 
of  the  processing  is  to  set  up  the  appropriate  completion  on  the  reprttemtmUm  mod* s',  the  trace  nodes 
ssrvB  aoldy  to  mediate  between  represent  ion  nodes,  which  are  not  directly  interconnected.  It  is  useful 
ta  regard  the  representation  nodes  as  a  data  blackboard,  and  the  trace  vectors  Tm  (or  equivalently  the 
patten  of  connections)  as  the  propum,  and  the  activation  values  as  internal  program  variables. 
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AH.  -  A Am  T.« 
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A^i  *  AA|  2^.(7.), 


Figure  2.  A  graphical  repreaentation  of  hannoninm.  The  nodes  denote  stochastic  processors,  and 
the  links  denote  communication  lines. 
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In  particular  it  should  be  noted  that  the  traditional  hierarchical  picture  of  successive  vertical 
layers  of  nodes  with  layer-to-Iayer  vertical  connections  does  not  cmst  here.  One  can  visualize  the 
representational  nodes  being  laid  out  with  abstract  representational  variables  to  the  right  and  concrete 
variables  to  the  left.  Input  at  the  leftmost  nodes  activates  traces  that  are  connected  to  them;  these 
traces  are  also  connected  to  slightly  more  abstract  representational  nodes,  which  are  then  assigned 
values  consistent  with  the  active  traces.  This  in  turn  activates  new  traces,  and  so  it  goes.  Decision¬ 
making  passes  rightward,  bouncing  back  and  forth  between  the  lower  to  upper  layers.  In  place  of  a 
pro- wired,  rigid  vertical  hierarchical  architecture  is  a  fluid  horizontal  architecture  that  can  implement  a 
hierarchy  when  appropriate.  (In  fact,  the  statistical  properties  of  the  participation  of  'nodes  on  the 
right*  in  the  traces  that  accumulate  during  experience  with  an  environment  are  what  determine  the 
'abstract  concepts*  that  dynamically  evolve  in  that  environment.) 

Once  a  node  has  computed  the  difference  in  harmony  A H  between  its  two  possible  states,  the 
likelihood  ratio  for  adopting  its  two  states  is  tu,fT .  Converting  this  to  the  absolute  probability  of 
changing  value  gives  the  result  shown  in  Figure  3.  This  sort  of  sigmoidal  relation  between  the  weighted 
sum  of  the  inputs  to  a  node  and  its  decision  is  common  in  activation  models;  however  two  differences 
should  be  noted.  First,  in  activation  models  the  ordinate  of  Figure  3  would  be  a  continuous  node 
value;  here,  it  is  a  probability  for  a  discrete  node  value.  Second,  the  slope  of  the  sigmoidal  curve  at  the 
origin,  1 fT ,  is  not  fixed;  it  increases  as  the  computation  proceeds. 

The  processing  features  visible  at  the  implementation  level — i.e.,  the  defining  properties  of 
harmonium — are  all  strict  consequences  of  the  algorithmic-lewd  analysis  of  harmony  theory.  Ongoing 
development  of  the  theory  is  aimed  at  filling  the  logical  and  empirical  gaps  Unking  the  analysis  at  the 
algorithmic  level  to  those  at  higher  levels,  and  gaining  experience  applying  harmonium  to  specific 
cognitive  tasks. 


raw" 


■  ww itw  r  ■«  ■  »v  ? ».M  VI  »  V»  gv.V^A*» W*v  V*  '--1  WA  'A'.T>  V*  'A  A  T1  'A  '.-•  '.'•  '-••  'A 'Ay  T->  r~  V*  ’  '.t  .■* 


11 


Harmony  Theory 


-  AH 


f 


Fifure  3.  The  sigmoidal  relation  between  the  weighted  ram  of  inputs  to  a  harmonium  node  (AH) 
and  the  probability  that  the  node  changes  its  value. 
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